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Abstract 

Background: Systennatic processing noise, which includes batch effects, is very common in microarray experiments 
but is often ignored despite its potential to confound or compromise experimental results. Compromised results 
are most likely when re-analysing or integrating datasets from public repositories due to the different conditions 
under which each dataset is generated. To better understand the relative noise-contributions of various factors in 
experimental-design, we assessed several lllumina and Affymetrix datasets for technical variation between replicate 
hybridisations of Universal Human Reference (UHRR) and individual or pooled breast-tumour RNA. 

Results: A varying degree of systematic noise was observed in each of the datasets, however in all cases the 
relative amount of variation between standard control RNA replicates was found to be greatest at earlier points in 
the sample-preparation workflow. For example, 40.6% of the total variation in reported expressions were attributed 
to replicate extractions, compared to 13.9% due to amplification/labelling and 10.8% between replicate 
hybridisations. Deliberate probe-wise batch-correction methods were effective in reducing the magnitude of this 
variation, although the level of improvement was dependent on the sources of noise included in the model. 
Systematic noise introduced at the chip, run, and experiment levels of a combined lllumina dataset were found to 
be highly dependant upon the experimental design. Both UHRR and pools of RNA, which were derived from the 
samples of interest, modelled technical variation well although the pools were significantly better correlated (4% 
average improvement) and better emulated the effects of systematic noise, over all probes, than the UHRRs. The 
effect of this noise was not uniform over all probes, with low GC-content probes found to be more vulnerable to 
batch variation than probes with a higher GC-content. 

Conclusions: The magnitude of systematic processing noise in a microarray experiment is variable across probes 
and experiments, however it is generally the case that procedures earlier in the sample-preparation workflow are 
liable to introduce the most noise. Careful experimental design is important to protect against noise, detailed 
meta-data should always be provided, and diagnostic procedures should be routinely performed prior to 
downstream analyses for the detection of bias in microarray studies. 



Background 

Increased adoption of high-throughput, whole-genome 
gene expression analysis technologies has led to an 
increased focus on the reliability of the experimental 
measurements they produce. Several recent articles have 
provided substantial evidence that systematic noise 
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introduced as a result of batch-processing have a detri- 
mental effect on data derived from microarrays [1-5]. 
Complete confounding of batches of array scans (even 
in the same laboratory, but at different times) with the 
variable under investigation can completely undermine 
reported results. For example, re-analysis of a study by 
Spielman et al [6] revealed that samples from each of 
their studied populations were processed separately 
(between 1-3 years apart) and that after application of a 
standard batch-correction method [7] none of the genes 
initially reported remained significantly differentially 
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expressed [8]. Several studies have attempted to assess 
reliability and consistency of array measurements and 
estimate potential sources of confounding systematic 
noise in an e ort to reduce these vulnerabilities [9,10]. 
The microarray quality control (MAQC) project was 
established to explore inter-platform and inter-labora- 
tory consistency of microarray-derived gene expression 
datasets using two reference RNA samples [11], as well 
as consistency of differential expression estimates 
[12,13]. These studies reported generally good consensus 
between replicate samples across technologies and 
laboratories, however the latter found the greatest level 
of consistency in results-lists between platforms was 
achieved when probes were ranked by decreasing fold- 
change with a generous p-value cutoff [13]. Despite this 
finding, it is still the case that most published studies 
determine whether or not genes are differentially 
expressed based on the significance of statistical tests. 
Our own previous studies have reported compelling evi- 
dence for the existence of confounding batch effects in 
Affymetrix and lUumina experiments performed in the 
same laboratory using a single platform [14,15]. It has 
also been shown that batch effects can be mitigated 
using per-probe batch-correction methods such as 
mean-centring [14], distance-weighted discrimination 
(DWD) [16], or empirical-Bayes corrections [17]. 

Awareness of and protection from batch effects is of 
particular importance in meta- or re-analyses of existing 
data, collated from multiple sources, where the intention 
is to exploit increased statistical power to detect subtle 
differences in expression between well-defined 



phenotypes. Direct comparison and integration of gene 
expression data through re-analyses is a highly attractive 
option due to data repositories such as the NCBI Gene 
Expression Omnibus (GEO) [18] and Array Express 
hosted by the European Bioinformatics Institute [19]. 
Unfortunately appraising the quality of publicly available 
data is not a trivial task and, even when the submission 
guidelines are followed completely, it can be difficult to 
identify technical issues that may lead to systematic 
noise in subsequent analysis that potentially compromise 
results. It is important therefore to have an appreciation 
of the relative sources of systematic error throughout 
the experiment to prioritise which elements of meta- 
data should accompany published datasets and, subse- 
quently, how to evaluate the experiment design to 
enable reliable re-analysis or integration of published 
data. 

Given previously reported variation in various array 
technologies [14,15,20-23], we wished to quantify the 
relative noise introduced at various stages in the 
experimental process prior to array scanning; including 
the choice of experimental protocol, array-version, and 
study design. In addition, we assess different types of 
control sample in terms of their ability to model batch 
effects and also consider correlations to specific prop- 
erties of array probes. In this study, we present an ana- 
lysis of in-house Affymetrix (Figure lA) and Illumina 
(Figure IB) datasets in addition to those generated as 
part of the MAQC project [11] (Figure IC) in order to 
assess the relative bias introduced in microarray 
results. 
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Results 

Relative sources of systematic noise associated with 
various stages in the Affymetrix experiment workflow 

In order to establish the relative impact of different 
sources of systematic noise we compared the expression 
profiles and numbers of consistently differentially 
expressed genes when compensating for variation intro- 
duced at a single or at multiple steps in the Affymetrix 
GeneChip sample-processing workflow. The dataset, 
previously reported in [14], consisted of MCF7 and 
MCFIOA samples for which technical replicates were 
available at several stages of sample preparation (Figure 
lA). We previously observed that the variability intro- 
duced by different generations of array platform or 
amplification protocol was larger than that attributed to 
the choice of scanner used for image-capture. This was 
also reflected in the number of commonly differentially 
expressed genes across datasets (Eg. MCF7-amplified/ 
MCFlOA-unamplified vs. MCF7-unamplified/MCF10A- 
amplified), which were fewer between platforms/ 



protocols than between scanners. In the current study, 
we found that when batch correction (using ComBat 
[17]) is performed to enable integration of these data- 
sets, compensating for the bias introduced by amplifica- 
tion and labelling resulted in much greater consistency 
in gene-lists (46% and 66% average improvement) than 
correcting for the scanner (11% improvement), see Fig- 
ure 2. Correction for significant sources of systematic 
error is required to improve the likelihood of consis- 
tently identifying differentially expressed genes from dif- 
ferent datasets, but these approaches do not completely 
remove all sources of technical error. 

Variance analysis of this dataset provided a more 
detailed breakdown of the estimated biological and 
technical error introduced at the various stages of sam- 
ple-preparation (Figure 3A). The majority of the pro- 
besets show far greater inter-sample variability than 
between the two cell-lines. Cell-line to cell-line (MCF7 
vs. MCFIOA) variability, mean-averaged over all probe- 
sets, contributed 9.6% to the total standard deviation; 
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Figure 2 Relative comparison of different batch effects in Affymetrix GeneChip data. LEFT, Comparison of the numbers of differentially 
expressed genes in common between MCF7 and MCFIOA triplicate samples depending upon what variables were included in the model used 
for the ComBat correction ('amp' = amplification, 'lab' = labelling, and 'sen' = scanner). Red points are mean counts, error bars are standard 
deviations, and significance of increased counts compared to no ComBat correction are indicated ('**' for p < 0:05 and '™ for p < 0:01, based 
on two-tailed t-tests). RIGHT, Pairwise Pearson correlation heatmaps of MCF7 and MCFIOA samples compensating for 1, 2, or 3 sources of batch 
effect using ComBat. Green and blue colour-bars denote MCF7 and MCFIOA samples, respectively. The lightest colours denote un-amplified 
samples, slightly darker are amplified samples, and the darkest colours are the scanner/labelling comparison. A, all data treated as a single group 
without batch-correction; B, batch-correction for amplification; C, batch-correction for amplification and an alternative labelling method; D, batch 
correction for amplification, labelling and different scanners used. 
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Figure 3 Comparison of variance components in Affymetrix and lllumina data. Comparison of MAQC and Paterson Affymetrix variance 
components. A: Probe-wise SD-estimates corresponding to several levels of technical variance (see figure key) in detection-filtered and quantile 
normalised MCF7/MCF10 expressions from the Paterson Affymetrix dataset. B: Probe-wise estimates of standard deviations (SD) corresponding to 
the inter-experiment (light blue), inter-run (dark blue), and inter-chip (green) technical variance in our UHRR lllumina data. The effect on these 
standard deviations following detection-filtering (DF), quantile-normalisation (QN), ComBot batch-correction by experiment (CB(expt)), and 
ComBot batch-correction by run (CB(run)) are shown. C: Probe-wise SD-estimates corresponding to inter-laboratory (pale blue), inter-chip (green), 
and inter-array (dark green) technical variances in detection-filtered and quantile normalised UHRR/UBRR expressions from the MAQC lllumina 
dataset. D: Probe-wise SD-estimates corresponding to inter-laboratory (pale blue) and inter-chip (green) technical variances in detection-filtered 
and quantile normalised UHRR/UBRR expressions from the MAQC Aff/metrix dataset. 



compared to 40.6% due to inter-sample variability, 
13.9% due to amplification/labelling, 9.9% & 10.8% for 
inter array and inter-scanner, and 15.2% at the within- 
scanner/residual level. The arrays were preprocessed 
using the popular RMA algorithm for background cor- 
rection followed by quantile normalisation (see meth- 
ods), however not performing any background 



correction resulted in a similar relative variance pro le 
but noticeably reduced absolute variances (Additional 
File 1). Either with or without background correction, 
this error pro le is not at all dissimilar to error esti- 
mates obtained from the various stages of sample pre- 
paration prior to a qPCR experiment in which the 
sampling step is usually by far the most variable while 
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the noise introduced during reverse-transcription is 
generally low, but can sometimes be larger than sam- 
pling error [24]. 

Technical processing noise across lllumina Beadarray 
chip, runs, and experiments 

Variance estimates were calculated for each of 15,757 
probes at the inter-experiment, inter-run, and inter-chip 
level using a nested analysis of variance described in 
methods (Figure IB, Figure 3B, and Additional File 2). 
The inter-experiment variance was the parameter with 
the greatest source of measurement noise, accounting 
for an average 61.7% of the total variation in reported 
gene expression. This is likely to be due to the use of a 
fresh round of amplification and labelling performed on 
the UHRR samples between experiments. Despite an 
overall reduction in the technical variation between 
probes, the fraction of the total variance contributed by 
each level (inter-chip, -run, and -experiment) was unaf- 
fected by standard, array-wise, quantile normalisation. 
Specific, probe-wise, batch correction greatly reduces 
the technical variation due to experiment and run (Fig- 
ure 3B). The slight effect due to the within-batch var- 
iance moderation can be measured in the reduction in 
inter-chip SD after either run of ComBat, The high 
inter-chip variation, compared to inter-run variation, 
appears to be driven by the new samples run on the 
HT-12 chips as this was not observed in the standalone 
analyses of the Ref-8 data alone. As with the Affymetrix 
data we explored the effect of background correction on 
these variances (Additional File 1); the practice of cor- 
recting for background signal in lllumina arrays is not 
as common as when processing Affymetrix data so we 
compared variances between preprocessing using quan- 
tile normalisation with no background correction to 
those after using the neqc function in limma package, 
which uses the various negative control probes on an 
lllumina array to compensate for background signal 
prior to quantile normalisation. As with the Affymetrix 
arrays it is clear that performing additional background 
correction has little effect on the relative variance struc- 
tures but generally inflates the magnitudes of the esti- 
mated variances. 

A similar variance analysis was also performed, for 
comparison, using the MAQC lllumina Human-6 
Expression BeadChip (vl) dataset [11]. Expression levels 
for 47,293 probes were subjected to the same detection- 
filtering criteria as our Ref-8 and HT-12 chips and the 
expressions of 21,896 surviving probes were quantile 
normalised prior to variance analyses. The design of the 
MAQC experiment only allowed for variance compo- 
nents to be estimated for the inter-laboratory, inter- 
chip, and inter-array levels. However, compared to the 
Ref-8/HT-12 inter-experiment standard deviations, the 



MAQC inter-laboratory standard deviations were much 
smaller; less than half as much on average (Figure 3C). 
This is likely to be a result of our use of different array- 
versions and widely different dates on which the arrays 
were processed (about 2 years). However it is note- 
worthy that the majority of the variance in the MAQC 
dataset was attributable to the intra-chip (inter-array) 
level and this is the only dataset, for which variance ana- 
lyses were performed, in which this phenomenon holds. 
Variance analysis of the MAQC Affymetrix U133 
Plus2.0 dataset (using 23,053 probes reported as 'pre- 
sent' across at least 80% of the samples) revealed a more 
familiar variance pattern, similar to that observed in our 
experiments, in which the majority of the variation is 
directly attributable to systematic differences between 
the different laboratories performing the experiments 
(Figure 3D). The estimated standard deviations at the 
inter-laboratory level in these data were similar to those 
observed at the inter-experiment level in the combined 
Ref-8/HT-12 data, as were the distributions of the coef- 
ficients of variation (standard deviation normalised by 
mean expression; data not shown). 

The Pearson-correlation heatmaps in Figure 4 demon- 
strate that the majority of the variation between 




Figure 4 Correlation heatmap of all replicate UHRR pairs. 

Heatmap of Pearson correlations between replicate pairs of UHRR 

samples highlights the inter-experiment, inter-run, and inter-chip 

differences; particularly at the inter-experiment level. Red cells 

correspond to -80% correlation and white to 100% correlation. 

Batches and sample numbers are consistent with the colouring and 

labelling in Figures 1 and Additional File 2. A: detection filtered (DF); 

B: DF & quantile normalised (QN); C: DF & QN &Com5c7t(by 

experiment); D: DF & QN UlomBotiby run). 
^ ) 
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replicate UHRR samples in our Ref-8/HT-12 data was 
introduced as a result of the three separate experiments 
(see methods and Additional File 2). However, some- 
what surprisingly, the poorest correlations exist between 
the two experiments involving the HT-12 chips (85.6% 
between runs 6/7 and run 8) as opposed to (90.4% 
between runs 1-5 and run 8). As previously observed, 
standard normalisation techniques based on array-wide 
intensity distributions, such as by quantiles or splines, 
do little to improve probe-wise correlations between 
replicate samples [15]. There is also an apparent band 
of poorer correlation corresponding to a single HT-12 
chip in run 7 (chip 22). For consistency with the control 
samples hybridised to the Ref-8 chips, two replicates of 
the original UHRR were retrieved from storage and 
added to one chip in each run with the HT-12 chips in 
experiment 2. However, only a marginal improvement 
was observed in the correlation of these old UHRR sam- 
ples compared to the freshly amplified and labelled 
replicates (Cl-18 vs. Czl-Cz2 = 92.8% compared to Cl- 
18 vs. C19-22 = 91.2%). Although, again, the sample on 
chip 22 was more poorly correlated with the UHRR 
samples on the Ref-8 chips than that hybridised to chip 
20, despite this chip appearing normal in all standard 
QC checks and internal controls except, perhaps, for 
slightly smaller bead-level standard errors compared to 
the other chips (see methods). 

All UHRR data were subjected to batch-correction 
using ComBat and correction at two levels was assessed: 
experiment-wise and run-wise correction. Following 
ComBat correction by experiment, all pairwise correla- 
tions increased to approximately the same level observed 
in the quantile-normalised Ref-8 data obtained from the 
first experiment (Figure 4C). However, we found pre- 
viously that despite high correlation between UHRR 
replicates the consistency between lists of statistically 
significant differentially expressed probes from duplicate 
sets of samples was poor without specific batch-correc- 
tion [15]. Following ComBat correction by run, correla- 
tions in these new UHRR samples approach those 
observed in the original, ComBat batch-corrected UHRR 
samples from the first experiment (Figure 4D). 

Inter-batch calibrators: Comparing UHRR with pools of 
tumour sample RNA 

Two controls of pooled clinical breast-tumour RNA 
were run on each of the four BeadChips in experiment 
2 (samples PI through P8 in Additional File 2). One 
pool was created from a mix of all seven pre-treatment 
samples run in this experiment and the second pool was 
created using all seven post-treatment samples (see 
methods). Results of pairwise Pearson-correlations 
between these pooled samples identified a large 



difference between chip 22 and the other three chips 
used in this experiment (Figures 5A &5B), consistent 
with that seen with the UHRR. No obvious differences 
were observed in the correlations between the different 
pools composed of either pre- or post-treatment RNA. 
In general, the correlations appeared similar to those 
observed between replicate UHRR samples illustrated in 
Figure 4. As noted before, quantile normalisation does 
little to remedy the poor correlation between chip 22 
and the other chips, but this is remedied with ComBat 
by treating the batches either as runs (Figure 5C) or, 
slightly better, as separate chips (Figure 5D). 

Compared to the UHRR samples in experiment 1 (Fig- 
ure 3B), variance components estimated by nested-anova 
between the pooled tumour samples (Figure 5) in 
experiment 2 both show a large inter-chip/intra-batch 
effect. The larger magnitude of variation supports the 
increased inter-chip component in UHRR and it is clear 
from both the correlations and the variance estimates of 
both the UHRR and tumour pools that the variance in 
the HT-12 data is driven by a particular chip (chip 22), 
rather than between runs as was the case in the Ref-8 
arrays. 

Gene expression in this small group of samples was 
rather varied, resulting in only 5 probes significantly dif- 
ferentially expressed following treatment that also sur- 
vived multiple-testing correction at q < 0.05; the same 5 
probes appeared in every results list regardless of nor- 
malisation or batch-correction. Despite the apparent low 
level of biological variation between the pre- and post- 
treatment samples and small number of significantly dif- 
ferentially expressed genes, both the UHRR and the 
pools accurately reproduced the fold-changes between 
duplicate tumour samples across the two runs (Addi- 
tional File 3). We calculated the differences in expres- 
sion across the two batches for all duplicate pairs of 
tumour samples and also between the UHRR/pool repli- 
cates. The correlation between differences observed in 
the tumour duplicates and either the UHRR or pool 
controls increased with the magnitude of the difference, 
although both the pre-and post-tumour pools were 
found to be significantly better correlated compared to 
UHRR (on average by 3.9% and 4.3%, using one-sample, 
two-tail t-tests p < 5.9 10'^ and p < 3.4 10'^, respec- 
tively). We also performed an exhaustive comparison of 
the fold-change in expression due to the different runs 
between the replicate pools of pre-treatment tumour 
RNA and each of the fold-changes between duplicate 
tumour samples (Additional File 4). In many cases there 
was strong agreement between absolute expression dif- 
ferences observed in the pool and the individual tumour 
duplicates, however such agreement is not necessary in 
order for the effect to be removed by ComBat, or 
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Figure 5 Correlation heatmap of all replicate pool pairs. RIGHT: Heatmap of Pearson correlations between replicate pairs of pooled-tumour 
control samples highlights the inter-run and inter-chip variation; particularly at the inter-chip level. Red cells correspond to -96% correlation and 
white to 100% correlation. Batches and sample numbers are consistent with the colouring and labelling in Figures 1 and Additional File 2. A: 
detection filtered (DF); B: DF & quantile normalised (QN); C: DF & QN 8^ComBot{by run); D: DF & QN 8^ComBot{by chip). LEFT: Variance estimates 
at various levels in replicate pooled-controls hybridised to the HT-12 chips used in experiment 2. Highlights the effect of various normalisation 
procedures on these variance estimates; as before, such procedures include detection-filtered (DF), quantile normalised (QN), ComBot corrected 
by run (CB(run)), and ComBat corrected by BeadChip (CB(chip)). 



alternative correction methods, as the absolute magni- 
tude of the effect is explicitly normalised during the cor- 
rection procedure. 

Properties of the probes with respect to batch variation 

In order to assess if batch effects are uniform across all 
genes, we investigated whether the properties of the 
probes themselves were affected by different sources of 
systematic error (see methods). Several descriptive sta- 
tistics were selected to serve as measures of probe- 
and mapping-specific properties that could conceivably 
influence probe expression [25-28]. These included 
compositional properties such as the guanine and cyto- 
sine (GC) nucleotide content as well as mapping prop- 
erties such as the position of the probe as a fraction of 
the total length of the target gene, the number of 
(known) transcripts consecutively probed, and the 
average number of exons within the probed gene; with 
the number of known transcripts and number of exons 
acting as proxies for gene complexity and size, 
respectively. 



The distribution of GC-content over the Illumina HT- 
12 probes (Figure 6A-top) clearly shows that they have 
been designed with a bias such that the nucleotide dis- 
tribution of the probes themselves favour a greater- 
than-random fraction of GC-nucleotides. The GC-frac- 
tion distribution of the 12,042 probes used in the analy- 
sis of run-induced fold change is plotted as a Gaussian- 
smoothed probability density in Figure 6B-top. Also 
plotted is the distribution of the subset of these probes 
found to be more than 2-fold up- or down-regulated 
due to the batch effect in any of the 14 duplicate sample 
pairs. These two distributions are very similar, suggest- 
ing that a large number of these technical effects are 
random fluctuations independent of GC-content. How- 
ever the distribution of any probe found to be more 
than 2-fold up- or down-regulated due to the batch 
effect in at least 6 of the duplicate sample pairs was sig- 
nificantly biased towards a lower GC-fraction (Figure 
6B-top; p-value < 2.2 10'^^). Taking this further, we 
correlated probe GC-fraction with error estimates from 
the Ref-8/HT-12 UHRR replicates (Figure 3B), revealing 
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Figure 6 Batch effect and probe GC content. TOP: A, histogram of GC content over our lllumina HT-12 probes; B, gaussian-smoothed 
probability density distribution (blacl<, N = 48,803). Union of probes more tlian 2-fold up- or down-regulated due to the batch effect in any of 
the 14 duplicate tumour-sample pairs from Additional File 4 (green line, N = 2,661). Union of probes more than 2-fold up- or down-regulated 
due to the batch effect in more than 5 of the duplicate tumour pairs (red line, N = 207). BOTTOM: Plots of probe CG-fraction against standard 
deviation estimated at the inter-experiment (A), inter-run (B), and inter-chip (C) levels in our combined lllumina Ref-8/HT-12 dataset. Red lines 
denote the cutoff s used in chi-squared analysis at each level. 



Kitchen et al. BMC Genomics 201 1, 12:589 
http://www.biomedcentral.eom/1 471 -21 64/1 2/589 



Page 9 of 14 



highly significant enrichment for higher standard devia- 
tions (SD) at lower GC-fractions (< 0.55) at the inter- 
experiment (Figure 6A-bottom; pVal < 5.3 10'^) 
and inter-run (Figure 6B-bottom; pVal < 2.2 10"^^) 
levels, but not at the inter-chip level (Figure 6C-bot- 
tom). No significant association was observed between 
low-signal probes and GC-content, however a small but 
significant enrichment for highly-signal-low-GC probes 
was found (Additional File 5). 

We also compared probe GC content with our var- 
iance estimates from the MAQC Illumina and Affyme- 
trix datasets, both of which have a similar overall 
distribution of probe-GC content, skewed in favour of 
higher GC fraction (data not shown). A significant 
enrichment for low-GC-high-SD probes was again 
observed, both at the inter-laboratory level (x^ pVal < 
9.8 10'^) and at the inter-chip level (x^ pVal < 2A 
10'^). As with the Ref-8/HT-12 data, there was no sig- 
nificant enrichment at the lowest level between replicate 
arrays. To determine whether this low-GC-high-SD 
association was resolvable between biological as well as 
technical replicates, we chose a subset of pre-and post- 
treatment samples from our Illumina Ref-8 dataset 
(these were pre-treatment biological replicates from run 
3 and non-duplicated post-treatment biological repli- 
cates from run 5 to avoid confounding due to inter-run 
variation). Despite a slightly greater number of probes 
with low-GC-high-SD in both the pre- and post-treat- 
ment sample-sets, the trend was not significant (data 
not shown). 

The subset of pre-treatment biological replicates from 
run 3 and non-duplicated post-treatment biological repli- 
cates from run 5 were also used to test for any associa- 
tion between probe-mapping properties and batch 
effects. For each probe, the standard deviation between 
the pre-treatment samples and, separately, the post-treat- 
ment samples were assessed against probe position in the 
target gene, number of concurrently probed transcripts, 
and the number of exons in the target gene. The distribu- 
tion of probe-location as a fraction of gene length among 
the Ref-8 probes is shown in Additional File 6. After 
adjusting for the strand of the target, there was no signifi- 
cant association for increased standard deviations of 
probes proximal to either the 3' or 5' end. To test 
whether probe-mapping may correlate with variation 
between replicate cRNA-syntheses, a similar analysis was 
performed using the MAQC Illumina dataset. These 
probes have a distribution similar to the Ref-8 arrays and 
the analyses again revealed no significant enrichment for 
estimated standard deviation due to probe position 
(Additional File 7). It is therefore likely that such probe- 
mapping effects can only be resolved when probes target 
different regions of the transcript, rather than for probes 



targeting the same region being affected by biological or 
technical variability. 

Discussion 

Microarrays represent a powerful means of rapidly 
assessing genome-wide expression patterns. Unfortu- 
nately confounding technical variation and systematic 
error in array technologies presents a major obstacle to 
their adoption for clinical diagnostics in humans. These 
factors are rarely documented, poorly understood, and 
their implications for experimental and clinical utility of 
microarrays frequently ignored [1,4]. Building on pre- 
vious investigations of technical variation between repli- 
cate RNA samples from breast tumour biopsies, this 
extended study used both Illumina and Affymetrix 
arrays to explore the reliability of reported expressions 
across a variety of experiment designs. Re-analysis of 
our MCF7 and MCFIOA Affymetrix datasets demon- 
strated that not compensating for one batch effect, such 
as the use of a different scanner (Eg. if the information 
was not available) would have a much smaller effect on 
the numbers of genes identified as commonly differen- 
tially expressed, than another batch effect (such as the 
labelling method or if RNA was amplified). 

Using a large combined dataset of conserved, reliably- 
detected probes on Illumina Ref-8 and HT-12 BeadChips 
(experiments 1 through 3) we found that the correlation 
between replicate UHRR hybridisations were consistently 
poorer than correlations previously reported using the 
Ref-8 data alone [15]. Interestingly, we found that 
labelled UHRR samples from the original experiment, 
which were stored at -80°C for approximately two years, 
hybridised to two arrays on the new HT-12 chips corre- 
lated better with the original Ref-8 samples than did 
freshly prepared UHRR replicates. This suggests that 
even long periods of frozen storage and additional freeze- 
thaw cycles introduce less noise into experimental mea- 
surements than that inherent in creating a new prepara- 
tion of labeled cRNA, even from the same RNA source. 

As in our previous investigation [15], quantile normali- 
sation did little to improve correlation between the 
UHRR replicates across the Ref-8/HT-12 dataset. How- 
ever specific batch-correction using ComBat [17] once 
again significantly improved correlations and is a valuable 
tool for removing systematic error introduced between 
experiments and/or processing runs. The inter-chip var- 
iation in the new HT-12 datasets was almost double what 
it was in the Ref-8 dataset and due to this increase in 
inter-chip variation and high-levels of inter-experiment 
variation, the inter-run variation in the combined dataset 
was largely obscured. However, as we have previously 
seen, inter-experiment and inter-run variances were lar- 
gely eliminated following ComBat corrections. 
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Variance estimates using the MAQC Illumina dataset 
were smaller in magnitude to the variances obtained 
from our Ref-8 data, revealing a surprisingly high level of 
reliability between the three laboratories that performed 
these experiments. In contrast, the MAQC Affymetrix 
dataset was found to be far more variable than their Illu- 
mina data but similar to the magnitudes of variance 
observed in both our combined Ref-8/HT-12 dataset and 
our Affymetrix dataset. The reason for the low variation 
in the MAQC Illumina data is unknown, especially since 
their study design deliberately split the sample replicates 
before cRNA synthesis; a much earlier stage in the sam- 
ple-prep workflow than our replicates (which were split 
after amplification and labelling). It is possible that the 
small number of laboratories (three, in total) performing 
the MAQC Illumina hybridisations produced highly con- 
cordant data completely by chance, while the larger num- 
ber of laboratories (six, in total) performing the 
Affymetrix experiments provided a more realistic reflec- 
tion of the technical variation in these data. 

Pooled sample vs. UHRR as batch-effect calibrators 

Several studies have found the use of replicate control 
samples such as UHRR to be a useful standard in micro- 
array experiments, suitable for monitoring expression 
consistency within and across a variety of genome-wide 
expression platforms [29-32]. However, such commercial 
controls are deliberately generic and deficiencies have 
been reported in terms of how well they represent speci- 
fic cell types [33]. Clearly UHRR is not representative of 
breast tumour RNA and therefore carries no guarantee 
of expressing RNAs that may be variably expressed in 
the specific subset of genes changed in breast tumour 
tissue. Therefore, in terms of compensating for con- 
founding technical variation, the very probes for which 
the correction is most important are those that are most 
neglected in the UHRR controls. 

Unfortunately the relatively small degree of legitimate 
biological differential expression between the pre-and 
post-treatment tumour biopsies provided little opportu- 
nity to assess the relative performance of UHRR and 
pooled batch calibrators on the consistency of reported 
differentially expressed probes. However, compared to 
the UHRR, the pooled tumour RNA controls were 
shown to more faithfully emulate the individual shift in 
expression between tumour technical-duplicates as a 
result of variation introduced between runs and between 
chips. Had it been possible to identify more legitimately 
differentially expressed probes between pre- and post- 
treatment samples, the pooled RNA would almost cer- 
tainly have made for a better batch-calibrator during 
ComBat correction than the UHRR controls. If a similar 
pooled calibrator was used in our previous study it 
seems reasonable to speculate that the consistency 



between the gene-lists reported as significantly differen- 
tially expressed would have been noticeably higher than 
the 74.1% achieved using UHRR calibrators. 

Batch effects relative to probe position and composition 

We took the opportunity to assess compositional prop- 
erties of the probes as a potential explanation/surrogate 
for the technical effects observed in the Ref-8/HT-12 
data. A highly significant trend in favour of low-GC 
content was identified in the core set of probes consis- 
tently affected by inter-run and inter-chip variation 
between sample duplicates. A similar, but less signifi- 
cant, enrichment for low-GC-high-SD probes was also 
observed in the MAQC Illumina dataset. This suggests 
that the magnitude of error introduced due to low 
probe GC-content is sufficiently great that it is resolva- 
ble between the replicate cRNA preparations assessed in 
the MAQC study. A similar observation regarding probe 
GC-content and expression consistency was recently 
reported in a comparison of RNA preservation proto- 
cols, using matched samples, in terms of the effect on 
results of downstream expression analyses [34]. A 
further correlation of probe composition, specifically 
with respect to GC-content, has been reported pre- 
viously in a spike-in experiment using Illumina Bead- 
Chips [35], in which it was found that probes with high- 
GC content tended to have a higher than expected sig- 
nal intensity, but probes with lower than average GC 
content had inflated differential expression statistics. We 
found no such association between low-expression and 
GC-content, however we did observe a low-signal-low- 
GC association; therefore the notion of inflated differen- 
tial expression stats for low-GC probes is supported not 
by probe intensity, but through greater variability in our 
data. The low-GC effect is likely to be related to ther- 
modynamic properties of hybridisation favouring high- 
GC probes/targets, a supposition that is rational given 
the deliberate high-GC bias in the design of the Illumina 
probesets. Probes with low GC-content appear to be 
inherently more vulnerable to systematic error but, 
although highly statistically significant, the magnitude of 
this variation in our data was small relative to that 
between biological replicates. It is therefore somewhat 
unlikely that such variation would pose a threat to the 
accurate classification of samples and, even in an experi- 
ment in which groups of biological replicates are poorly 
distributed across chips and runs, is also unlikely to be 
a serious confound to statistical tests for differential 
expression. 

The proximity, with respect to the target transcript, of 
probes has been reported to strongly influence the cor- 
relation of expression measurements between technolo- 
gies [36]. The analyses performed here were designed to 
assess whether such probe-transcript mapping 
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influenced expressions reported by the same platform, 
however no such correlation was observed either 
between biological or technical replicate samples. A 
more thorough analysis of the MAQC datasets would 
provide further insight into any relationship between 
probe-location and expression between a variety of dif- 
ferent platforms and sample-preparation procedures. 

Conclusions 

The key source of systematic error in any given micro- 
array experiment is unpredictable, but can generally be 
attributed to RNA extraction and, to some extent, label- 
Ung and amplification protocols. The results presented 
here, including those derived from external data sources, 
suggest that it is not recommended to analyse individual 
test samples (e.g. to then try and classif)^), but instead to 
run several at the same time to get a better estimate of 
the experimental variation in order to be better 
equipped to compensate for it. 

Sound experimental design is of critical importance to 
avoid confounding systematic variation. Randomisation 
of samples over arrays on each BeadChip and careful 
blocking of samples from each group of interest within 
and across runs is necessary to protect against con- 
founding systematic error. In situations where this is 
not possible, detailed meta-data should be preserved for 
each array that includes, at the very least, the date and 
time of each hybridisation and scan. In either case, diag- 
nostic procedures such as PCA or SVA should be routi- 
nely performed prior to downstream array analyses. 

Methods 

Samples and Arrays 

RNA was extracted using the RNeasy Mini Kit, includ- 
ing RNase-Free DNase treatment (Qiagen, Crawley, Uni- 
ted Kingdom) from paired breast tumour tissue, 
alternatively a Universal Human Reference RNA 
(UHRR; Stratagene, Stockport, United Kingdom) was 
used. RNA was amplified and biotinylated using Illu- 
mina TotalPrep RNA Amplification Kit (Ambion) and 
quantified on an Agilent 2100 Bioanalyser. 750 ng 
cRNA per sample was hybridised to lUumina Human 
Ref-8 v2 or Human HT-12 v3 expression BeadChips 
(lUumina, Cambridge, United Kingdom) using the 
Whole-Genome Expression Direct Hybridisation kit 
(lUumina) and scanned with a BeadStation 500GX (lUu- 
mina). Duplication of the clinical samples was per- 
formed after labeUing and labeUed samples were stored 
as per the manufacturer s recommendations. 

In addition to previously described samples [15,37], 
four lUumina HT-12 (v3) BeadChips were used in a fol- 
low-up experiment, described as 'Experiment 2' (Figures 
IB and in Additional File 2). These chips were pro- 
cessed in pairs over the course of two days (referred to 



as runs). A single UHRR replicate, from a fresh prepara- 
tion, was hybridised to each chip ('C19'-'C22'). Two 
replicates of the original UHRR from the original study 
were retrieved from storage at -80°C and one added to a 
chip on each run ('Czl' and 'Cz2'). Pools of pre-treat- 
ment and post-treatment tumour RNA were created 
from the clinical samples and each pool was then hybri- 
dised to each chip. Three further lUumina Human HT- 
12 (v3) BeadChips were processed over the course of a 
single day, described as 'experiment 3' (see Additional 
FUe 2). Additional UHRR samples were included in two 
groups of repUcates corresponding to different labeUing 
protocols; the first group of samples was obtained from 
the same amplification and labelling (Ambion) as the 
UHRR samples used in experiment 2 ('C23'-'C25'). The 
second group of samples was labeUed using the NuGen 
amplification kit (NuGen) ('Cnl'-'Cn4'; Figure Addi- 
tional File 2). The remaining samples on these Bead- 
Chips were not analysed within this study. 

Methods for generating the MCF7 and MCFIOA tri- 
plicate Affymetrix U133A data can be found in the ori- 
ginal publication [14]. Methods for the MAQC lUumina 
Human-6 Expression BeadChip (vl) and Affymetrix 
U133 Plus2.0 array hybridisations are provided in [11]. 
All raw gene expression les and clinical annotation gen- 
erated in this study are publicly available from the 
caBIG supported Edinburgh Clinical Research FacUity 
Data Repository https://catissuesuite.ecmc.ed.ac.uk/caar- 
ray/ and on request to the corresponding author. 

Statistical Methods 

Quality control analyses revealed no systematic or indi- 
vidual problems with the Affymetrix or lUumina data at 
the array level as assessed by the array-level distribu- 
tions of signal intensities (Additional File 8). For the 
lUumina Ref-8 and HT-12 arrays we also assessed the 
noise distributions between replicate beads as large dif- 
ferences in measurement precision might cause pro- 
blems with the interpretation of results derived from the 
combined datasets; however both the bead standard 
errors and bead-representation on both lUumina chip- 
types were very similar (Additional File 8). 
Handling of lllumina data 

All lllumina array scans were initially processed using 
lUumina's BeadStudio software using default parameters, 
including removal of local background signal in the cap- 
tured images but without background normalisation. 

To perform reliable correlation and variance analyses 
simultaneously on both lllumina chip-types, we identi- 
fied and retained only the 15,757 probes with exactly 
conserved sequences between Ref-8 and HT-12 that 
also mapped uniquely to the genome. Probes were re- 
mapped to the human genome (NCBI build 37) using 
Bowtie [38] (v. 0.12.7) allowing for no mismatched 
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bases in the alignment. Alignment annotation, for 
example the position of the probe within the host 
gene, was provided by in-house software and the NCBI 
RefSeq annotation database [39]. The 309 probes com- 
mon to both the Ref-8 and HT-12 chips that aligned 
to the genome but fell within intergenic regions (or 
those that, for whatever reason, were not covered by 
the RefSeq annotation) were considered good and 
retained, along with the 15,448 annotated intragenic 
probes, for further analyses. All correlations, unless 
otherwise stated, are based on Pearson's product- 
moment and are reported as percentages where 100% 
is a perfect correlation. 

These 15,757 uniquely mappable probes, represented 
on both Ref-8 and HT-12 chips, were further filtered 
(unless stated otherwise) using the detection confidence 
reported by BeadStudio- determined for each probe 
based on the expression of the internal control probes, 
local background intensity, and the uniformity of the 
reported intensity of the bead. The filtering was per- 
formed prior to quantile normalisation such that probes 
with a detection confidence less than or equal to 95% in 
more than 20% of the samples were removed from 
further analysis. 

All lUumina data were detection-filtered and normal- 
ised together (8,948 of the conserved probes passed the 
detection-filter) before assessment by pairwise Pearson 
correlation. Results of pairwise Spearman rank correla- 
tions following separate filtering and normalisation of 
the Ref-8/HT-12 chips were very similar (data not 
shown). The globally filtered/normalised data were used 
in all subsequent UHRR analyses for consistency with 
our previous results and for a more reliable interpreta- 
tion of nested variance analyses. 
Handling of Affymetrix data 

Affymetrix data were processed in R using functions con- 
tained within the affy package. Expression values were 
produced from probe in intensities using the expresso 
function with, unless stated otherwise, RMA background 
correction, quantile normalisation, and median-polish 
summarisation over probesets. Following this preproces- 
sing, probes were filtered such that those reported 
'absent' in more than 20% of the samples were removed. 
Variance estimation 

We applied a linear additive model to the Affymetrix and 
lUumina expression data on the log-scale to estimate the 
specified variance contributions. These contributions are 
assumed to be independent and randomly drawn from 
log-normal distributions. As all factors meet in unique 
combinations a nested, or hierarchical, variance model is 
individually applied for each gene. Models of this kind 
are formally defined in [40] and have previously been 
used in the context of gene-expression experiment design 
[24,41]. Variance estimates in all analyses described 



herein were performed using the REML procedure imple- 
mented in the nlme package in R [42-44]. In all mixed 
models the biological variables such as different cell-lines 
in the Paterson dataset and the UHRR/UBRR dilutions in 
the MAQC dataset were treated as fixed effects and all 
downstream sample-processing levels treated as random 
effects. The percentages reported in reference to the var- 
iance structures, for example in Figure 3, were calculated 
based on the mean standard deviation, of all linear mod- 
els over all probes, attributed to each level. 
Batch correction 

The main difference between standard array normalisa- 
tion and batch-correction is that the latter does not 
make the assumption that all probes are affected equally 
by batch-effects and, as such, performs individual adjust- 
ments on each probe across all samples. The ComBat 
approach, as discussed in detail in [17], is a somewhat 
involved process, but can essentially be summarised in 
three stages. First, using a method similar to autoscaling, 
the data over all genes are standardised so that they all 
have similar mean and variance. This step compensates 
for the different expressions and variations of the var- 
ious probes/genes that would otherwise bias the batch 
effect estimates. Next the mean and variances of all 
samples in each batch, over all probes/genes, are esti- 
mated using a linear model and these parameters consti- 
tute the first prior-distribution. Next, independently for 
each gene, the sample mean and variance is estimated 
for each batch and are used to estimate the parameters 
of the additive and multiplicative noise distributions 
using the method of moments; these constitute the sec- 
ond prior-distributions. Finally, using the parameter esti- 
mates for these two prior distributions along with Bayes 
theorem, posterior distributions for each of the additive 
and multiplicative noise distributions are calculated; 
final values of these batch effect parameters are esti- 
mated as the expected values of the posterior distribu- 
tions. This empirical Bayes procedure allows 
information from all genes to be used to estimate batch 
effects for each gene, providing more stable estimates 
than the standard sample mean and sample variance. 
Differential expression 

Gene expression changes were compared before and 
after treatments and between responders and non- 
responders using Bioconductor [45] algorithms imple- 
mented in the statistical programming language, R 
(v.2.12.1) [46]. lUumina probe pro le expression data 
were normalised by quantiles and corrected for batch 
processing effects using ComBat [17]. Genes differen- 
tially expressed between paired pre- and post-treatment 
samples were identified using limma (v.3.6.9) [47] and 
SAM (v. 1.28) [48]. For the analysis using limma, empiri- 
cal Bayes variance shrinkage was employed and genes 
were defined as being differentially expressed after 
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satisfying a minimum fold-change of ± 1.5 and a maxi- 
mum, Benjamini-Hochberg adjusted, p-value of 0.01. 
For the SAM analysis, the differentially expressed genes 
were selected, following 100 permutations, at a maxi- 
mum predicted false discovery rate of 5% and the same 
minimum fold- change of ± 1.5. 

R scripts and data archives can be downloaded from. 
https://catissuesuite.ecmc.ed.ac.uk/caarray/ 

Additional material 
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